Everything sticks better when you’re coding towards a goal. We remember the things we do better when they work towards a logical, progressive narrative, rather than a jumbled set of skills loosely thrown together. As such, we’re all going to be working together towards answering a few specific questions with this dataset:
TODAY’S QUESTIONS:
1.) Does sampling probability lead people to choose safe choices more frequently?
2.) Do people make choices faster when deciding within a single domain versus domains including both losses and gains?
3.) Do people make choices faster across different combinations of problem types and conditions?
4.) Does expected value predict response time?
5.) Does expected value influence the probability of choosing the safe option?
Furthermore, I’m going to outline the stages that we can expect to see throughout that process. Some of these might be a little less meaningful if you don’t have a solid statistical background, but you should nonetheless be able to follow along in practice, if not in theory. This is not an exercise in statistical knowledge; it is an exercise in coding, but the content just happens to be statistical.
A fair warning
This is an R Markdown script. Much like a typical R script, it contains code documenting your process, such that any other individual could reproduce your work. However, unlike typical R scripts, R Markdown allows us to narrate the process (much as I’m doing now). The different types of text (narration and code) usually appear in different colors so you can tell the difference; though, what color it appears as for you might differ based on your settings.
A typical R script will interpret any text that you include as a command to be carried out, unless otherwise marked by a hashtag (#). The output will appear in your R Studio Console Window. By output, we just mean the product, sum, or status of whatever calculation or item you are asking R to compute or show for you.
R Markdown can be a little more complicated, but you also have a lot of control over how commands, text, and output are displayed. Ultimately, because of this additional control, I find R Markdown scripts more intuitive and easy to follow (though not everyone feels the same). The output of any commands you enter into an R Markdown script typically appears just below the script that created it. We can read it in the Console if we prefer, but it’s nice having all the information we want in one window (this one, if you’re reading this as a script). We can also minimize output and text that we don’t want to look at in the moment by clicking on the arrow next to the line count on the left side of the window [ ].
In order to tell R when you want it to treat something as a command in R Markdown, you use these backtick (```) symbols (usually upper left corner of the keyboard) followed by “{r }”. Below that, we enter the command(s) we want to run, and then we close it all out with those backticks again. Next to “{r }”, we can add descriptions of what is happening within that block. Let’s look at this example below :
# This is an example of a command. I use the three little dash marks, followed by {r } to let R know I want to give it a command. I can describe what the command is after the r (Like I did with "Example"), though, it's not necessary to do so. Also, notice, much like a typical *R script*, I can use a # to let R know that this text should not be consider a command. If I were to press "Enter" and go down a line, I can have it start paying attention again and calculate the sum of 2 and 4.
Value <- 2 + 4
Value
## [1] 6
# to calculate that line in *R Markdown*, I can press COMMAND + ENTER, or to run all of the commands within this block, I can press the green right facing arrow in the upper right corner of the block.
If you pressed the “Run” command [ ], hopefully you saw a little box pop up just below your block that said “[1] 6”. If you want to export the output, whatever it may be, to a new window, simply click the first icon of the the three in the upper right corner of the output box [
]. To minimize, click the second [
], and to delete, press the third [
]. Note that deleting the output does not affect the related code.
Also of note, the down-facing arrow (second icon in the upper right corner of the code block) will tell R “Run all of the blocks of command that I have before this block” [ ]. It can be helpful if you make a mistake and don’t want to rerun all of the previous blocks one by one to get back to where you were. It also makes your code very easy for other people to run. They can quite literally do it with the click of a button! If we click the cog icon in the same tray, we can access the output options and manipulate where output appears and what it looks like, but that’s beyond the scope of this review [
]. Let’s go ahead and try it below; if you click on the “Run all” function for the Example 3 block, Example 1 and Example 2 should rerun and run, respectively:
(Value * 3) + 2
## [1] 20
(Value/10)
## [1] 0.6
R Markdown scripts will allow us, once we are done, to compile all of our work, or “knit” it, and share it as either an interactive .pdf or .html document which others can follow. They are very helpful if you want to run a stat coding blog, or, say, teach a workshop on how to code in R. I’m not going to go much deeper into the setup of R Markdown, but I have included helpful references in the course materials for those interested.
Of note for those familiar with R, but new to R Markdown: A few functions work differently in R Markdown, relative to an R script. For example, in a typical R script, we could set our working directory using the setwd() function and it would remain the working directory until otherwise specified. R Markdown does not play nice with the setwd command for whatever reason, so I’ve found the simplest option is to just recall the file pathway whenever you want to load or save a file. The others aren’t super relevant for this presentation, but something to keep in mind if anyone gets converted to the cult of R Markdown. Additional details about R Markdown and it’s functionality can be found in this .pdf, but aren’t necessary for this bootcamp tutorial.
Now that we’ve got that out of the way, let’s get started!
In this study, participants were asked to make timed decisions between two sets of probabilistic outcomes. For example, participants may have been asked to choose which of the two would be preferable: a situation where they have a 10% chance of winning $10 dollars (and, necessarily, a 90% chance of winning nothing), or a 40% chance of winning $2 dollars (and a 60% chance of winning nothing). These problems differed in how much participants could win or lose and the probability of those outcomes. Additionally, participants completed this task twice: once following the summary just described and another time where they were not explicitly told the probabilities of winning or losing, but instead were tasked with sampling from the different outcomes to make their decision.
An approximation of the description-based decision-making task
We’ve encountered packages, or something akin to packages, in the previous workshops. You can think of them as add-ons. If we start playing with R right out of the box, with no add-ons, it has a lot of useful tools. We call this Base R. You can do almost anything you can think of with Base R, but it’s not always the most efficient way to do things. Because is open source, meaning it’s free to use and collectively developed, users have programmed collection of shortcuts in the form of these packages. So, in that sense, each package just contains one or more commands not available in Base R (add-ons!).
Every new package is centralized in R’s repository, so even though thousands of people are working on these things independently, you don’t need to leave R to find them. Before they can be used, they must be installed, and you can do that pretty simply:
install.packages("PackageName")
Once a package is installed, it does not need to be installed again. Rerunning the code above will result in updating the given package. If you’re using R Studio, you can also see a list of your packages and their associated descriptions in the ‘Packages’ Tab of your Viewer Window.
Packages tab of viewer window where one can visualize previously installed packages
However, just because a package is installed, it doesn’t mean it’s initialized. Sometimes different packages use the same commands. Sometimes packages take up a lot of disk space. For these reasons, R asks you to tell it what packages you want to use every time you start a new code. We’re going to do that by specifying:
library(PackageName)
Notice that we drop the quotation marks now. We just specify the (case-sensitive) package name and it let’s R know we are planning on using that this session. If we ever want to explore the functions contained within a package in conjunction with examples, we can either go to the R documentation website or type ‘??PackageName’ into the Console, which will then populate the Help Tab of the Viewer Window with information on the package.
Install and load the following packages:
effects
ggpubr
ggthemes
interactions
psych
plyr
RColorBrewer
report
rstatix
sjPlot
sjstats
skimr
stargazer
tidyverse
'
'
Like any other language or program, R needs to be told where the data that we’d like to work with is located on our computer. We can create a new variable to make this process simple. I’m using a Windows computer, so nearly everything is contained within my C:/ Drive. If I wasn’t sure of my path, R makes it relatively easy to find it. If I start by entering this:
Path <- "C:/"
I can press tab when my cursor is to the left of the forward slash to see a list of directories contained within my C:/ Drive. Here’s an example of what you should see:
An example of R’s Tab-Controlled Dropdown Menus
From there, I can keep hitting tab until I get to the directory, or folder, that contains the files I want to work with.
Path <- "C:/Users/Administrator/Desktop/Grad School/Bootcamp/"
Again, because R Markdown doesn’t handle setting working directories like normal R, we’ll use the command that actually specifies the working directory when we load in the data.
Before we load in the data, I want to highlight a little terminology. The data that R works with is always contained within what we call a ‘dataframe’. You’ve likely heard this term in the past workshops. It’s quite literally what it sounds like: a framework in which to situate your data. It contains many cells that are situated into columns (which have names) and rows (which may or may now have names). Think of an excel sheet, except instead of an effectively infinite void of blank columns and rows which you could scroll or write into at your convenience, in R and similar languages, we need to clearly denote the boundaries (how many columns and rows) the data is contained within. We do this whenever we create a dataframe from scratch, or we read in pre-existing data. Because this workshop builds on the previous ones, we don’t need to do the former; we can just ‘read in’ pre-existing data.
There are many ways to load data into R and they all depend upon what format the data is in. R can handle data from .csv, .xlsx, .txt, .html, .json, SPSS, Stata, SAS, among others. R also has it’s own data format (.RDA, .Rdata). With the exception of .RDA, .csv is often the cleanest means of reading in data. We won’t cover the other formats, but they are fairly exhaustively covered in this tutorial.
#Notice we are setting our working directory
setwd(Path)
df <- read.csv(file = "df.csv",
header = T,
sep=",",
na.strings=c("","N/A","NA"),
stringsAsFactors = TRUE)
This may look a little intimidating to start, so let’s break it down piece by piece.
If you remember from the filepath variable above, the less-than and hyphen symbols together (<-) is how we denote a value to an object in R. The object taking the value goes on the left, the value we are assigning goes on the right. Think of the symbols like an arrow. In this case, we are saving the data that we are reading in as an object (really a dataframe) aptly named ‘df’.
Next, we are specifying the command. The command always sits outside of the parentheses. In this case, the command is read.csv(), which is pretty self-explanatory. Then the arguments that a command can understand and which modify how the command functions are contained within the parentheses. If you ever want to see a list of arguments that a command can understand, you can either:
1.) Type the command name into the Console or Source Window script, press Tab within the parentheses and
use the arrow keys to navigate the subsequent dropdown menu that appears.
2.) Enter ?CommandName in the Console and read documentation on it in the Viewer Window.
Option 1 is more convenient and probably a better choice for a command you are already familiar with. Option 2 is more exhaustive and great for a command you are still trying to understand.
A visualization of what you will see if you press Tab to view valid arguments for a given command. Notice that hovering over an argument results in a pop out window which briefly describes the argument.
The first command is to tell R the name of the file in the format of a string. The second is telling R that it is true that the top row of the .csv document contains column names. The third is telling R that the symbol separating different cells is a comma (This is usually a given with .csv, which stand for “Comma-Separated Values”). Next, we are telling R that any cells that contain empty values or N/A values should be treated as N/A. Lastly, we are telling R anytime a column contains only characters, structure it as a factor (we’ll tackle exactly what a factor is in a moment, but for now, just know that it saves us time later.) Notice that these arguments are separated by commas. Also notice that we can put each of these arguments on their own lines, or leave them on the same line and R treats them the same.
All of that being said, let’s go ahead and read in data.
If done correctly, we should see our Environment populate with a dataframe labeled df. Go ahead and click on df to view it. You should see it pop up in your Source Window (The same window you are likely writing script in).
A visualization of the Environment Window. Note that the number of observations and variables may be different from the dataframe you are currently reading in.
Alternatively, if you only want to check the top of it, you can run:
head(df)
## X TrialNumber Condition Run P1_1 P1_2 SafeOption P2_2 O2_2 O2_1
## 1 1 1 Descriptive Run1 0.20 1.00 2 0.00 0 0
## 2 2 1 Experiential Run1 0.60 1.00 2 0.00 0 -12
## 3 3 1 Descriptive Run1 0.20 1.00 2 0.00 0 0
## 4 4 1 Experiential Run2 0.30 1.00 2 0.00 0 -4
## 5 5 1 Descriptive Run2 0.75 0.75 1 0.25 30 10
## 6 6 1 Experiential Run1 0.60 1.00 2 0.00 0 -12
## ProbNumber O1_2 O1_1 EV2 EV1 ProbType P2_1 ran order trialStartTime
## 1 SD12 6 29 6.00 5.8 SD 0.80 1 0 0.0304285
## 2 MD12 12 30 12.00 13.2 MD 0.40 1 0 5.0438600
## 3 SD12 6 29 6.00 5.8 SD 0.80 1 0 0.0304285
## 4 MD11 5 27 5.00 5.3 MD 0.70 1 0 3.7146200
## 5 RR3 -5 2 3.75 4.0 RR 0.25 1 0 0.0383165
## 6 MD12 12 30 12.00 13.2 MD 0.40 1 0 5.0438600
## viewingEndTime viewingDur_total thinkOnset decideOnset leftRight resp
## 1 6.07381 6.04338 6.07412 9.10662 0 1
## 2 NA NA 21.68700 24.71810 1 2
## 3 6.07381 6.04338 6.07412 9.10662 0 1
## 4 NA NA 20.31290 23.34310 1 2
## 5 6.07235 6.03403 6.07269 9.10551 0 1
## 6 NA NA 21.68700 24.71810 1 2
## safeChoice resp_onset rt decideDur_total assignedITI decide_pad
## 1 1 10.800030 1.6934143 11.0696 1.7448 0.006585648
## 2 0 25.291534 0.5734797 20.5477 1.4961 1.126520395
## 3 1 10.800030 1.6934143 11.0696 1.7448 0.006585648
## 4 0 23.754299 0.4112276 20.3397 1.2297 1.288772464
## 5 0 9.839894 0.7343864 10.1016 1.8199 0.965613604
## 6 0 25.291534 0.5734797 20.5477 1.4961 1.126520395
## totalITI ITI_onset TrialEndTime samplingCount_total samplingCount_risky
## 1 1.75139 11.1169 12.8734 NA NA
## 2 2.98036 25.8023 28.7848 24 14
## 3 1.75139 11.1169 12.8734 NA NA
## 4 2.92112 24.2644 27.1936 26 9
## 5 2.78551 10.1580 12.9552 NA NA
## 6 2.98036 25.8023 28.7848 24 14
## samplingCount_safe switchCount samplingDur_assigned samplingDur_total
## 1 NA NA NA NA
## 2 10 3 17 16.6424
## 3 NA NA NA NA
## 4 17 2 17 16.5975
## 5 NA NA NA NA
## 6 10 3 17 16.6424
## samplingEndTime sampling_pad PID
## 1 NA NA A001
## 2 21.6862 0.357744 A001
## 3 NA NA A001
## 4 20.3121 0.402647 A001
## 5 NA NA A001
## 6 21.6862 0.357744 A001
read.csv() contains a command that will read the names of our column headers and check them for syntactically valid variable names and that there are no duplicates. Using one of the two methods highlighted above, find what that argument is and rewrite the data-reading command with that argument included.
'
'
We now have data contained within an R dataframe in a way that R can understand. However, we aren’t interested in all of the data in that dataframe. We will only use a small number of the total variables within our analyses. Let’s go over those variables of interest and define them. Don’t worry too much about memorizing these or understanding how they fit together just yet; just know that they are here in case you need to come back and reference them:
1.) PID - Factor; Participant ID
2.) Condition - Factor; whether participants had probabilities described to them or were asked to sample them.
3.) Run - Factor; Which run the trial was completed within.
4.) ProbType - Factor; defines what type of choice participants had to make: Risky Risks (RR) presented two possible outcomes
with neither one fully guaranteed to occur. Single Domain (SD) decisions presented two possible outcomes
that were either both losses or both gains; one guaranteed to occur. Mixed Domain (MD) decisions presented one
possible outcome that was a loss and one that was a gain; one guaranteed to occur.
5.) order - Numeric; The order in which problems were presented
6.) EV1 - Numeric; Expected value of Option 1 (Probability * Value)
7.) EV2 - Numeric; Expected value of Option 2 (Probability * Value)
8.) samplingCount_risky - Numeric; Total number of risky option samples acquired
9.) samplingCount_safe - Numeric; Total number of safe option samples acquired
10.) switchCount - Numeric; The number of times a subject ‘switched’ sampling machines during the SAMPLING phase.
A switch count of 0 would mean they only sampled from one machine.
11.) safeChoice - Factor; Whether the subject chose the ‘safe’ option. 0 = safe, 1 = risky
12.) rt - Numeric; A calculation of reaction time of decision
We totally could just progress with the dataframe as is … BUT for the sake of demonstration, let’s say we want to clean our space by getting rid of variables that aren’t those we highlighted above. It’s a pretty simple process to do that using Base R’s subset() function. Basically, there are three informal components to a subset() function: 1.) Source Dataframe, wherein we just specify which dataframe we are going to be modifying, 2.) Conditionals, wherein we tell R the conditions that must be met for a row’s data to be carried forward, and 3.) Column Selection, wherein we tell R exactly which columns we want to carry forward to the new dataframe. The formatting looks something like this:
df <- subset(x = DataframeName,
DataframeName$ColumnName == Value,
select = c("ColumnName1", "ColumnName1"))
There is a new command or two contained within this subset() command. Let’s go over these.
On the second line, we see this statement:
DataframeName$ColumnName == Value
This statement contains two points I want to highlight.
1.) Whenever we place a dollar sign ($) following a dataframe, it allows us to tell R that we want to focus on one specific column within that dataframe. It’s an extremely convenient function because no matter where that column is placed, we can be sure R will identify the right column (providing we spell it right; note that R is case-sensitive).
2.) Second, we see not one, but two equals signs (==). When two value operators (=, >, <, !) are placed next to each other in R, and many other languages, we aren’t assigning a value to an object; we are comparing the values between two different objects. In this instance, using two equals signs, if the two values are equal, it would produce a TRUE value; if not, then a FALSE. This variable which can only take the value of either True or False is called a boolean. When we tell R to compare the value on the right with this specific column, what it is mechanically doing is iterating through each row within this column, comparing the column present, and noting whether the conditional is True or False. In this way, it’s very similar to For Loops in other languages.
We also see “select = c(…)”. The select argument just tells subset() which columns to carry forward. The c() function, though, concatenates or combines multiple values together into a single variable or object. Let’s try it the wrong way first to drive the point home.
("Bingo", "Bongo")
## Error: <text>:1:11: unexpected ','
## 1: ("Bingo",
## ^
You get an error message because R doesn’t know what to do with these two items. However, if we concatenate…
c("Bingo", "Bongo")
## [1] "Bingo" "Bongo"
It links the two items together.
Let’s subset the df dataframe with two conditional statements and selecting only the columns we had highlighted above. The conditional statements should ask R to only carry forward rows where the value of the variable ran equals 1 AND rows where the value of safeChoice is not N/A. The ran variable notes whether a trial actually ran or was skipped, wherein 0 means it was skipped and 1 means it ran. To link two conditional statements together, we put an ampersands (&) between them. To tell whether a cell has a value of N/A, we modify the is.na() command by adding a bang or exclamation point (!) in front of it. (i.e., !is.na()) We only want to analyze trials that actually ran and have an outcome value. Save the new dataframe as an object named df (NOTE:We often don’t want to overwrite our old dataframe names, but in this case, it’s fine).
'
'
I’ve used terms factor and numeric to describe variables a few times now, but I haven’t really explained what those mean, or why they are important, and they matter much more in R than other coding languages, at least in my experience. These terms refer to the structure that the data exists in. Data that is structured numerically (including numbers and integers, or “int”) can be subjected to mathematical operations (e.g., addition subtraction, etc.), but non-numeric data cannot. You can imagine a series of numbers as an example. On the other hand, factors are categorical variables that take on a limited number of different values. Likert scale responses (i.e., Not at all, Somewhat, Greatly), which can be hierarchically organized, but lack clear objective values are examples. Certain functions only work properly with certain data structures. “sum("Bingo", "Bongo")” wouldn’t work for obvious reasons, but other incongruencies are much more subtle. Sometimes R incorrectly assumes what structure a variable is; if we assign numeric values to “Yes/No” outcome variables, R will assume they are numeric. As such, it is very important that we accurately structure our data. I have highlighted the two primary data classes that we’ll be working with, but there are many more, which you can read more about here, should you be interested.
There’s a very easy way to assess the structure of our data all at once using the str() command:
str(df)
## 'data.frame': 136 obs. of 12 variables:
## $ PID : Factor w/ 1 level "A001": 1 1 1 1 1 1 1 1 1 1 ...
## $ Condition : Factor w/ 2 levels "Descriptive",..: 1 2 1 2 1 2 1 2 1 2 ...
## $ Run : Factor w/ 2 levels "Run1","Run2": 1 1 1 2 2 1 2 2 1 1 ...
## $ ProbType : Factor w/ 3 levels "MD","RR","SD": 3 1 3 1 2 1 2 1 1 2 ...
## $ order : int 0 0 0 0 0 0 0 0 1 1 ...
## $ EV1 : num 5.8 13.2 5.8 5.3 4 13.2 4 5.3 3.5 4 ...
## $ EV2 : num 6 12 6 5 3.75 12 3.75 5 3 3.75 ...
## $ samplingCount_risky: int NA 14 NA 9 NA 14 NA 9 NA 14 ...
## $ samplingCount_safe : int NA 10 NA 17 NA 10 NA 17 NA 9 ...
## $ switchCount : int NA 3 NA 2 NA 3 NA 2 NA 3 ...
## $ safeChoice : num 1 0 1 0 0 0 0 0 0 1 ...
## $ rt : num 1.693 0.573 1.693 0.411 0.734 ...
It works when the object is an entire dataframe, but it works equally well if we only want to examine one specific variable as well:
str(df$safeChoice)
## num [1:136] 1 0 1 0 0 0 0 0 0 1 ...
But unfortunately we see an example of R mis-structuring a variable. We can easily coerce the variable into the appropriate structure, though:
df$safeChoice <- as.factor(df$safeChoice)
Here, we are telling R to treat the safeChoice variable within the df dataframe as a factor, using the as.factor() command, and then overwriting the variable, saving it as itself.
Although I noted at the beginning of Section 0.6.1 that each variable is either numeric or a factor, in reality, your R likely did not read in your variables as such. In this exercise: 1.) note the structure of your dataframe columns,
2.) compare the structure of each variable to what is listed in our reference list,
3.) coerce any that do not match to the appropriate structure.
NOTE: The command to coerce non-numeric variables to numeric ones is is.numeric(). Don’t worry about converting int to num.
'
'
Although thus far, we’re almost exclusively talked about data in the context of dataframes, we can store values as free-floating variables that exist outside of dataframes as well. For example, let’s create a variable called x and give it a value of 50.
x <- 50
Now that we’ve created that variable, there are at least four ways to check the value of this variable in R studio:
1.) You can type ‘x’ into the console, press enter, and it should produce the current value of x. 2.) You can include ‘x’ as a command in your script. Other languages require that you include a print() argument; R doesn’t. 3.) You can place the statement you wrote to create the variable within parentheses. 4.) You can see/click your variables within your Environment Window under the Values heading.
Eagle-eyed coders might notice that we already have at least one other variable under the Values heading, that being our Path variable. Much like those contained within a dataframe, free-floating variables can take both numeric or alphabetic values, and our Path is a case of the latter. We should clearly see our value in the Environment Window, but let’s go ahead and type ‘x’ into the Console Window to check that our variable was correctly specified. If correct, you should see this:
A view of what you should see in the console.
Much like how we overwrote to change the structure, we can overwrite a variable easily and use the second option to check its value.
x <- 20
x
## [1] 20
We don’t really want to keep x cluttering our environment, though, so we’re going to use the rm() command to remove it. Notably, this won’t work for variables or data contained within a dataframe, just those free-floating ones in your Environment.
rm(x)
Let’s try to make this a little more practical. Let’s say converting the numeric values of df$safeChoice was not enough; let’s say we wanted to replace those with labels “Safe” and “Not Safe”. We could simply do this one at a time by asking R for the value of each cell, one by one by one by one…
df[1,11]
## [1] 1
## Levels: 0 1
Here, I just used R’s coordinate system to call for the value of the cell that’s in the first row and the eleventh column. I see that it says it is 1, which I know to be “Not Safe”. As such, I could try to manually replace this value like this:
df[1,11] <- as.factor("Not Safe")
df[1,11]
Not only would that take forever and is it more susceptible to human error, but R also gives us an error message. If we were to try to force R to change this value, it will result in data loss, as the 1 changes to NA. This is because it is expecting that the only possible values this column can take are 0 and 1. We see that, while analytically valuable, factor structures can sometimes be overly restrictive during the cleaning phase.
Given that, the best way to correct this is to restructure this variable as a character and call our new friend Conditional Statements to do the work for us. Here, we are essentially telling R to go row by row within the safeChoice column and check each time whether the value that is currently there is equal to 1 or 0. If the former, change the value to “Not Safe”, if the latter, change the value to “Safe”.
#I'm copying this column before I modify it for use in a later analysis. Ignore for now.
df$safeChoice_num <- df$safeChoice
df$safeChoice <- as.character(df$safeChoice)
df$safeChoice[df$safeChoice == 1] <- "Not Safe"
df$safeChoice[df$safeChoice == 0] <- "Safe"
head(df$safeChoice)
## [1] "Not Safe" "Safe" "Not Safe" "Safe" "Safe" "Safe"
On top of being able to overwrite currently existing columns in R, we can leverage data in currently existing columns to create new columns of new data rather simply. For example, let’s say we wanted to create a difference score, that is, the value of one variable subtracted from the value of another variable. We could easily do that by simply naming the first and second variables, placing a subtraction operator between them, and then assigning that value to a new variable name contained within that dataframe.
df$NewVariableName <- df$ExistingVariable1 - df$ExistingVariable2
R will automatically create a new column with that new name and append it to the end of your dataframe. The value for each row in that new column will be equal to the value of Variable 1 minus Variable 2 within that same row.
Let’s calculate values for a new column within df called EV_Diff. It should be equal to the value of the first Expected Value minus the second Expected Value. Do the same for Sample_Diff, equal to the value of the Safe Sample Count minus the Risky Sample Count. Note that the order you place these variables in may be important.
'
'
EV_Diff could be a helpful variable later on, but there’s a slight problem. We might not really necessarily care whether EV1 was the larger amount of EV2 was the larger amount, we just want to know the pure magnitude of the difference. You could maybe imagine a scenario where EV1 = 10 and EV2 = 5, as well as EV1 = 5 and EV2 = 10. In the former, EV_Diff = 5 and in the latter EV_Diff = -5. These values will be treated differently by R, but we want R to treat them the same. As such, we want to create a new variable that captures the absolute value of EV_Diff, which we will call EV_DiffAbs. The absolute value command is abs().
'
'
That was a big lead up to the main event, but basically familiarity with R syntax and conventions is really important before we start conducting analyses in R. Hopefully you feel like you have a half-decent grasp on what the environment contains, how to use the console to enter commands, and where to look when you need help. We’re now going to get into analyses. We’re going to start relatively simple with analytic approaches that utilize qualitative, categorical variables as predictors and get progressively more advanced to analyses that use continuous, numeric, quantitative predictors. At this juncture, let me reward my captive audience with a cute seasonally-appropriate photo of my dog.
Sadie in a Christmas Sweater, circa 2020.
Hopefully, when you heard terms like “Categorical” and “Qualitative”, you immediately thought about factors. These analyses are going to be using factor variables to test hypotheses and explore questions within out dataset. We’ll be looking at how to code Chi Squares, T-Tests, and ANOVAs in this section, as well as taking a sizable detour to explain data visualization using ggplot2. You don’t need to know much about those tests; I’ll try to explain enough for you to keep up, but if you want to read more statistical theory, click each of their names to follow the hyperlinks.
A Chi-Square Test can be used when both the predictor and outcome, or independent and dependent variables, respectively, are categorical. They involve checking if the frequencies in our outcome or dependent categories match what we would expect to see if there were no differences between our predictor or independent categories. Within the context of our data, we could ask a question like whether people choice the safe option more frequently within the experience-based decision-making task relative to the description-based decision-making task.
QUESTION: Does how sampling probability is learned lead people to choose safe choices more frequently?
HYPOTHESIS: Experience sampling will be associated with a higher frequency of safe choices.
RELEVANT VARIABLES: Dependent: safeChoice (Factor) Independent: Condition (Factor)
ANALYSIS: Chi-Square
To start, what we be good to know is simply what the frequency of each category for the variables is. We are going to deviate a little bit from what we’d seen in the past and use syntax and functions contained within the dplyr package. dplyr is a bit of a swiss army knife in that it actually contains many other packages that are very well established in R and are very useful. I personally have trouble keeping the syntax straight, so I only use it when I have to, but I can’t deny that it’s exceptionally useful (and I’m the only person I know that doesn’t like it). I share that only to acknowledge that I’m not the best resource on it, but here’s another really helpful .pdf that highlights the primary functions of dplyr syntax.
(prop <- with(df, table(Condition, safeChoice)) %>%
prop.table())
## safeChoice
## Condition Not Safe Safe
## Descriptive 0.08823529 0.41176471
## Experiential 0.08823529 0.41176471
Let’s note that the unusual %>% operator functions to pipe data from one argument into another within the dplyr framework. So what this is doing is telling R with data that we are pulling from the df dataframe, create a table consisting of all of the values contained within the Condition and safeChoice columns. Then, pipe that data into a new table consisting of just the proportions of the frequencies of the different categories contained within those two columns, and save that object as “prop”.
So what we are seeing is, out of all of the trials this participant saw, the proportion of which they either made Safe or Risky Decisions for each of the conditions. We often think of these as percentages, though, so let’s reformat them a little and round off the extra digits. We can do that simply with the round function.
round(prop,2)
## safeChoice
## Condition Not Safe Safe
## Descriptive 0.09 0.41
## Experiential 0.09 0.41
The first argument notes the object; the second the number of digits past the decimal place that are significant. We don’t overwrite this object, though. We want our values as accurate as possible for the next step. Interestingly enough, there are seemingly no differences by condition.
Following prop’s creation, we then are going to feed those values into a new object to get the final totals. Note again here that we are putting parentheses around our statement. This is so that when R is done creating the object, it will just automatically it them for us, rather than us having to ask it in a separate statement.
(total <- prop * length(rownames(df)))
## safeChoice
## Condition Not Safe Safe
## Descriptive 12 56
## Experiential 12 56
Once again we see the total number of trials wherein some chose either the risky or safe option are identical across conditions. Does not bode well for our hypothesis. Even so, it’s nice to see ahead of time what we’re working with. Now let’s run our chi square test. It’s fairly simple. We note our x variable and our y variable and then run the command. We are saving it as an object chi…
(chi <- chisq.test(x = df$Condition,
y = df$safeChoice))
##
## Pearson's Chi-squared test
##
## data: df$Condition and df$safeChoice
## X-squared = 0, df = 1, p-value = 1
…that way, at a later time, we can call the exact same object and see the results again.
chi
##
## Pearson's Chi-squared test
##
## data: df$Condition and df$safeChoice
## X-squared = 0, df = 1, p-value = 1
…and as suspected, not even the hint of a significant difference was present (because no differences were present at all).
This is by no means, necessary, but I always like to write my journal-ready summary statement within my code so I can quickly reference it in the future and remember how I had interpreted the results in the past.
A Pearson's Chi Squared test with Yates' continuity correction determined that safe choices [freq = 112, prop = .824] were not reported more frequently than risky choices [freq = 24, prop = .176] for either experiential- or descriptive-based decision-making[x-squared = 0, df = 1, p < 1.000].
Let’s run a second Chi Square Test, this time examining whether Problem Type has any influence on the frequency of Safe Choices. While it’s perhaps not very a theoretically well-grounded question, we can nonetheless code it much like we had for the last question we asked. Make sure to produce a proportion table and a chi-square test. Don’t worry about the summary statement.
'
'
While a Chi Square Test is great, verbalizing the relationship can only go so far to communicate how these variables are related. Where R really excels, relative to other languages, is visualizing relationships in plots. ggplot2 allows us to build visually-pleasing graphics one layer at a time". This chunk below should probably look big and confusing, and that’s fine for now. We’re going to run it so we can see what the end product looks like. Then we’re going to talk through these lines and I’ll show you that it really looks much more intimidating than it is.
ggplot(data = df, aes(x = Condition, color = safeChoice, fill = safeChoice)) +
geom_bar() +
scale_x_discrete("Condition", breaks = c("Descriptive", "Experiential")) +
scale_y_continuous(breaks = c(0,20,40,60,80)) +
labs(title = "Differences in Risk Choices by Condition",
subtitle = "",
x =NULL,
y ="Frequency",
caption = "p > 0.05") +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Set2") +
coord_cartesian(ylim=c(0.0, 80.0)) +
theme_classic() +
theme(plot.title = element_text(face="bold", size=13, hjust = 0.5)) +
theme(plot.subtitle = element_text(size = 10, hjust = 0.5, face = "italic")) +
theme(plot.caption = element_text(size = 8, hjust = 0.0, face = "italic")) +
theme(axis.title = element_text(size = 16)) +
theme(axis.text.x = element_text(size = 14, color = "Black")) +
theme(axis.text.y = element_text(size = 12, color = "Black"))
Furthermore, once you write a good ggplot script once, you’ll find that you’ll rarely have to rewrite it in the future. Most of the commands within can be applied regardless of what type of plot you want to create; you just make small little tweaks to change the output.
ggplot2 is widely considered the premier data visualization tool for R. As such, the arguments used by the ggplot command are very important to understand. The volume of customization options is frankly overwhelming, but they allow us to produce the beautiful figures that we often see associated with R and are substantially less overwhelming when standardized in a template format. I certainly don’t memorize all of the settings I’ve found to look best and reproduce them every time I need to plot new data. Instead, I create a template that I can easily copy and paste or recall, and then focus on changing only the details relevant to the data at hand. I’m going to share my standard ggplot template and highlight what all of the arguments do. By the end, you should feel knowledgeable enough to start tweaking the aesthetic qualities of your own plots to find what you feel looks best, and maybe even explore the bevy of additional arguments which I certainly won’t have time to cover within the ggplot documentation.
Relevant xkcd
=== Plot Creation ===
To start, we’re just going to do a basic plot with both an x and y axis. To make it simple, I’m going to use variables that have numeric values, so the x axis will contain the absolute difference in expected values and the y axis will contain the range of possible reaction times, but the actual data doesn’t matter much here. Instead, we’re going to focus on just about everything but the data.
ggplot(data = df, aes(x = EV_DiffAbs, y = rt))
This is just about the most basic plot we could run in ggplot. We start by defining the dataframe from which we are pulling our variables (df), and then we tell R which two variables will define the aesthetics of this plot (aes(x = EV_DiffAbs, y = rt)). As you can see, that’s enough for R to create a basic 2d plane on which we could visualize data. R automatically places the variable we defined as X on the x axis and labels it, and then populates the Y axis as well. Both axes are given reasonable ranges, based on the range of the data.
=== Adding Geoms ===
However, if we want to see the actual data, we need to define what type of plot (i.e., line, bar, scatter, etc.) we want to see. In order to add more layers to a ggplot, we need to use an addition sign:
ggplot(df, aes(x = EV_DiffAbs, y = rt)) +
geom_smooth(method="lm", alpha = .25, size = 1.5)
Here, we kept the first line the same, and we’re adding a command for ggplot to produce a geometric shape (hence, “geom_”) of the type “smooth” (or a smoothed regression line). There are dozens of possible geoms one could use in ggplot. For reference, you can either check the supplementary infographic I supplied or enter “?ggplot” in the console to review the documentation. Following the command, you’ll see arguments that define the method of smoothing, how dark or light the standard error cloud is around the regression line, and how thick the regression line should be, respectively. These are aesthetic choices and will obviously change quite a bit depending upon the plot, so we won’t dive much deeper into these. For now, I want to move on to commands that I almost always use no matter what type of plot I’m producing.
=== Labeling Plots ===
The above plot might be perfectly fine for you and your team, who have been looking at the data for months, to understand the relationship you’re studying, but adding more descriptive labels would go a long way to make these plots more accessible to others. There are a million different commands that will alter the labels of the axes, but I’m partial to this approach, which allows us to modify the title and add subtitles and captions as well:
ggplot(df, aes(x = EV_DiffAbs, y = rt)) +
geom_smooth(method="lm", alpha = .25, size = 1.5) +
labs(title = "Expected Value Difference Predicts Decision Reaction Time",
subtitle = "Absolutle Difference in Expected Value Has No Effect on Reaction Time.",
x = "Absolute Difference in Expected Value",
y ="Reaction time (in seconds)",
caption = "p > 0.05: N.S. \np < 0.05: * \np < 0.01: ** \np < 0.001: ***")
Notice, once again, that we string together multiple layers through the use of addition signs, but within the context of a single function, commas are still used to separate arguments. The labs function allows us to designate what we want to appear in the title and axes. I also like to take advantage of the subtitle and caption arguments to give a very brief, easy to understand interpretation of what I want my readers to take a way from the plot, and to note important information in interpreting the plot. As you can see, character text must be bookended with quotation marks. I use “backslash + n” within the caption to let R know I want everything to the right to appear on a new line.
=== Specifying Plot Axes Range ===
This looks better, certainly, but let’s say participants only had 2 seconds to respond to the decisions. This axis would be a little disingenuous, because it would suggest the range was actually much smaller and artificially make my relationship look larger. My Y axis in that case should really range from 0 to 2, for the amount of time participants had to respond. The coord_cartesian function is by far the best way to manipulate the displayed range of your axes without actually modifying the data (Some other functions will exclude datapoints that sit outside of the range you define, meaning they won’t be included in your regression line). We simply define the lower and upper limits of our x (xlim) and/or y (ylim) axis:
ggplot(df, aes(x = EV_DiffAbs, y = rt)) +
geom_smooth(method="lm", alpha = .25, size = 1.5) +
labs(title = "Expected Value Difference Predicts Decision Reaction Time",
subtitle = "Absolutle Difference in Expected Value Has No Effect on Reaction Time.",
x = "Absolute Difference in Expected Value",
y ="Reaction time (in seconds)",
caption = "p > 0.05: N.S. \np < 0.05: * \np < 0.01: ** \np < 0.001: ***") +
coord_cartesian(ylim=c(0, 2))
=== Using Themes ===
Great! Now the data is all technically correct, but I really don’t like that ugly gray background. Let’s change that with the host of functions that come with the ggtheme package. I believe that base ggplot2 may accept some theme commands without needing ggtheme, but ggtheme, at the least, expands the options we have to customize our plots. We’ll start with my usual favorite, theme_classic:
ggplot(df, aes(x = EV_DiffAbs, y = rt)) +
geom_smooth(method="lm", alpha = .25, size = 1.5) +
labs(title = "Expected Value Difference Predicts Decision Reaction Time",
subtitle = "Absolutle Difference in Expected Value Has No Effect on Reaction Time.",
x = "Absolute Difference in Expected Value",
y ="Reaction time (in seconds)",
caption = "p > 0.05: N.S. \np < 0.05: * \np < 0.01: ** \np < 0.001: ***") +
coord_cartesian(ylim=c(0, 2)) +
theme_classic()
## `geom_smooth()` using formula 'y ~ x'
The “theme_” commands do an overhaul on the presentation of the plot, and make many changes all at once. Let’s take a look what another few look like
ggplot(df, aes(x = EV_DiffAbs, y = rt)) +
geom_smooth(method="lm", alpha = .25, size = 1.5) +
labs(title = "Expected Value Difference Predicts Decision Reaction Time",
subtitle = "Absolutle Difference in Expected Value Has No Effect on Reaction Time.",
x = "Absolute Difference in Expected Value",
y ="Reaction time (in seconds)",
caption = "p > 0.05: N.S. \np < 0.05: * \np < 0.01: ** \np < 0.001: ***") +
coord_cartesian(ylim=c(0, 2)) +
theme_dark()
## `geom_smooth()` using formula 'y ~ x'
ggplot(df, aes(x = EV_DiffAbs, y = rt)) +
geom_smooth(method="lm", alpha = .25, size = 1.5) +
labs(title = "Expected Value Difference Predicts Decision Reaction Time",
subtitle = "Absolutle Difference in Expected Value Has No Effect on Reaction Time.",
x = "Absolute Difference in Expected Value",
y ="Reaction time (in seconds)",
caption = "p > 0.05: N.S. \np < 0.05: * \np < 0.01: ** \np < 0.001: ***") +
coord_cartesian(ylim=c(0, 2)) +
theme_minimal()
ggplot(df, aes(x = EV_DiffAbs, y = rt)) +
geom_smooth(method="lm", alpha = .25, size = 1.5) +
labs(title = "Expected Value Difference Predicts Decision Reaction Time",
subtitle = "Absolutle Difference in Expected Value Has No Effect on Reaction Time.",
x = "Absolute Difference in Expected Value",
y ="Reaction time (in seconds)",
caption = "p > 0.05: N.S. \np < 0.05: * \np < 0.01: ** \np < 0.001: ***") +
coord_cartesian(ylim=c(0, 2)) +
theme_solarized()
## `geom_smooth()` using formula 'y ~ x'
Quite a range! There are a ton of other options which you can review in the ggthemes documentation.
=== Modifying Text Elements ===
For the sake of time, we’re going to move onto the last thing I usually do, and that’s modify the appearance of the text using theme commands, It drives me nuts that the title is off-center and that the “p”’s in the caption are misaligned, so we’re going to run a series of commands to correct this.
ggplot(df, aes(x = EV_DiffAbs, y = rt)) +
geom_smooth(method="lm", alpha = .25, size = 1.5) +
labs(title = "Expected Value Difference Predicts Decision Reaction Time",
subtitle = "Absolutle Difference in Expected Value Has No Effect on Reaction Time.",
x = "Absolute Difference in Expected Value",
y ="Reaction time (in seconds)",
caption = "p > 0.05: N.S. \np < 0.05: * \np < 0.01: ** \np < 0.001: ***") +
coord_cartesian(ylim=c(0, 2)) +
theme_classic() +
theme(plot.title = element_text(face="bold", size=13, hjust = 0.5)) +
theme(plot.subtitle = element_text(face = "italic", size = 10, hjust = 0.5)) +
theme(plot.caption = element_text(face = "italic", size = 8, hjust = 0.0)) +
theme(axis.title = element_text(size = 12)) +
theme(axis.text.x = element_text(size = 14, color = "Black")) +
theme(axis.text.y = element_text(size = 14, color = "Black"))
As you can see, we added several new lines that all utilize the theme function. Within each line, we then specify what element of the plot we are targeting, whether it be plot subtitle, plot caption, or anything else. We then note that we are specifically modifying the text of each of these elements. Then, just like with the geom_ series of functions, there are a massive number of arguments which we could utilize to change the color of the text, the orientation, the size, the face type, the font, and the centering. All of these are also well documented and there are a ton of blogs which can walk you through these. One note on the hjust (centering) argument: A value of 0 left-justifies the text (As we see with the caption), a value of 1 right justifies text, and a value of 0.5 centers it (as we see in the title).
One last note: later layers take precedence. So, if you give ggplot to arguments which affect the same plot element in a conflicting way (maybe you want tell it you want the X axis labeled “Bananas” early on, and then “Apples” at the end), ggplot will often not let you know that these arguments are in conflict, and just accept whatever argument came last (so your X axis will be named Apples). We’re really just sort of scratching the surface of what we can do in R when it comes to visualizations, but hopefully this provides a good foundation for what is still to come. The next two modules will be much quicker and more focused.
Let’s see you give it a show. Modify the code of the plot below so that: 1.) the title and subtitle are right justified
2.) the text of both axes appear blue in color
3.) the p value key now says “I’m, like, a total R whiz.”
ggplot(df, aes(x = EV_DiffAbs, y = rt)) +
geom_smooth(method="lm", alpha = .25, size = 1.5) +
labs(title = "Expected Value Difference Predicts Decision Reaction Time",
subtitle = "Absolutle Difference in Expected Value Has No Effect on Reaction Time.",
x = "Absolute Difference in Expected Value",
y ="Reaction time (in seconds)",
caption = "p > 0.05: N.S. \np < 0.05: * \np < 0.01: ** \np < 0.001: ***") +
coord_cartesian(ylim=c(0, 2)) +
theme_classic() +
theme(plot.title = element_text(face="bold", size=13, hjust = 0.5)) +
theme(plot.subtitle = element_text(face = "italic", size = 10, hjust = 0.5)) +
theme(plot.caption = element_text(face = "italic", size = 8, hjust = 0.0)) +
theme(axis.title = element_text(size = 12)) +
theme(axis.text.x = element_text(size = 14, color = "Black")) +
theme(axis.text.y = element_text(size = 14, color = "Black"))
Ideally, you should get something like this.
## `geom_smooth()` using formula 'y ~ x'
A T-Test can be used when both the predictor variable consists of two categorical options and the outcome or dependent variable is numeric in value. A T-Test tells you how significant the differences between these categories or groups are. In other words, it lets you know if the differences between the means of two groups could have observed by chance. We could imagine a situation where an evil teacher told half of the class before a test the right chapter to study from and told the other half of the class the wrong chapter to study from. The two categories or groups might be Right Chapter and Wrong Chapter and the outcome variable would be Test Score. Using a T-Test, we could determine whether studying from the right materials produces higher test scores. Within the context of our data, we could ask a question like whether people make their choices faster when they are considering decisions that only deal with losses or gains, rather than decisions that include both losses and gains.
QUESTION: Do people make choices faster when deciding within a single domain versus domains including both losses and gains?
HYPOTHESIS: People make choices faster when deciding within a single domain.
RELEVANT VARIABLES: Dependent: rt (numeric) Independent: ProbType (Factor)
ANALYSIS: Two-Sample T-Test
We are going to start similarly to how we did last time by first taking a look at the data. The skimr package is near and dear to my heart. It provides a quick way to see the distribution, or the shape, of the data as well as some details about it. A lot of this is beyond the context of a coding class and more important for stats, but I wanted to demonstrate how easy this package makes things.
skim(df)
| Name | df |
| Number of rows | 136 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 5 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| safeChoice | 0 | 1 | 4 | 8 | 0 | 2 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| PID | 0 | 1 | FALSE | 1 | A00: 136 |
| Condition | 0 | 1 | FALSE | 2 | Des: 68, Exp: 68 |
| Run | 0 | 1 | FALSE | 2 | Run: 68, Run: 68 |
| ProbType | 0 | 1 | FALSE | 3 | RR: 48, MD: 44, SD: 44 |
| safeChoice_num | 0 | 1 | FALSE | 2 | 0: 112, 1: 24 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| order | 0 | 1.0 | 8.46 | 5.21 | 0.00 | 4.00 | 8.50 | 13.00 | 17.00 | ▇▆▇▆▇ |
| EV1 | 0 | 1.0 | 6.72 | 9.08 | -7.00 | -0.80 | 5.00 | 12.00 | 27.20 | ▆▇▅▁▂ |
| EV2 | 0 | 1.0 | 6.55 | 9.46 | -8.00 | -1.25 | 5.00 | 12.00 | 29.50 | ▆▇▃▂▂ |
| samplingCount_risky | 68 | 0.5 | 12.38 | 4.72 | 3.00 | 9.00 | 12.00 | 16.00 | 23.00 | ▅▇▇▇▂ |
| samplingCount_safe | 68 | 0.5 | 9.85 | 3.53 | 6.00 | 7.00 | 8.50 | 12.00 | 20.00 | ▇▃▃▁▁ |
| switchCount | 68 | 0.5 | 1.79 | 0.59 | 1.00 | 1.00 | 2.00 | 2.00 | 3.00 | ▃▁▇▁▁ |
| rt | 0 | 1.0 | 0.56 | 0.31 | 0.29 | 0.39 | 0.45 | 0.58 | 1.69 | ▇▂▁▁▁ |
| EV_Diff | 0 | 1.0 | 0.17 | 1.57 | -2.30 | -0.40 | 0.20 | 0.65 | 10.30 | ▇▆▁▁▁ |
| Sample_Diff | 68 | 0.5 | -2.53 | 7.18 | -15.00 | -8.00 | -3.50 | 2.00 | 14.00 | ▅▇▇▃▂ |
| EV_DiffAbs | 0 | 1.0 | 0.90 | 1.30 | 0.20 | 0.24 | 0.50 | 1.20 | 10.30 | ▇▁▁▁▁ |
The describe function from the psych package provides complementary information to skimr. However, there are a number of packages which contain a function called describe. As such, it’s important to denote which package we are pulling this command from. We can do that by first naming the package and putting two colons (::) between the package name and the command name. The command works equally well on whole dataframes, or columns from dataframes.
psych::describe(df$rt)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 136 0.56 0.31 0.45 0.5 0.11 0.29 1.69 1.41 2.06 3.84 0.03
We can conduct a whole host of other visualizations, too. We’re already running short on time, so I’m not going to dive too deeply into the theory behind these. In brief: Each statistical test assumes certain things about the data. If those assumptions aren’t met, the reliability of the test is questionable; the test may still produce an answer, but it might not be of value. You can maybe your phone and your laptop. Both might charge using a USB-C connection, but the laptop requires much greater voltage than your phone (which is why it usually has the box in the chain). Using your phone’s USB-C cable might charge your laptop, but it might be unreliable or might cause lasting damage.
I don’t know why I thought a visual would drive this point home more…
… but here we are.
In the same way, if the data does not meet the specifications that a test requires, the results could be unreliable or flat out incorrect. The visualizations below allow us to examine some of these assumptions, but you can come back to them at a later date when you’ve developed a little more statistical acumen.
#Histogram for T-Tests
hist(df$rt)
NOTE: The QQ Plot and Box Plot here use syntax similar to what we saw in ggplot because they are in the gg family.
#QQ Plots for T-Tests
ggqqplot(df,
"rt",
ggtheme = theme_bw()) +
facet_grid(~ProbType,
labeller = "label_both")
# Box Plot for T-Tests
ggboxplot(df,
x = "ProbType",
y = "rt",
palette = "jco",
short.panel.labs = FALSE)
NOTE: The normality and outlier tests below here use dplyr syntax, which, again, you can check here for references on how to interpret.
# Outlier Testing for T-Tests
df %>%
group_by(ProbType) %>%
identify_outliers(rt)
# Shapiro-Wilk Test for Normality for T-Tests
df %>%
group_by(ProbType) %>%
shapiro_test(rt)
# Homogeneity of variance for T-Tests
df %>%
levene_test(rt ~ ProbType)
Okay, assuming that we’ve met our assumptions and decided that a t-test is an appropriate means of measuring our question, let’s run the actual t-test. It’s a little more complicated than our chi-square test. For one, we only want to compare two levels of a factor that contains three levels in total. As such, we need to use conditional statements again to specify our variables. What we are comparing here are the mean values of reaction time for trials where the type of problem was single domain versus mixed domain. As such, we are going to specify we want to see reaction time when ProbType == SD and when ProbType == MD. We next have an argument which asks us whether this study is a within-subjects or a between-subjects design. This question is within-subjects, since each participant provided data for both single and mixed domains, so we mark that as true. Lastly, R is asking us to define our alternative hypothesis, which is a little beyond the scope of this review, so you will have to take my word that “two.sided” is the right call. Lastly, when we look at T-Tests, standard deviations are very important, but the t.test() function won’t automatically generate those. We are using the sd() function to capture the standard deviation of reaction action, and we’re adding the argument (na.rm) that tells R to ignore any row that has a value of N/A.
(ttest<- t.test(x = df$rt[df$ProbType == "SD"],
y = df$rt[df$ProbType == "MD"],
paired = TRUE,
alternative = "two.sided"))
##
## Paired t-test
##
## data: df$rt[df$ProbType == "SD"] and df$rt[df$ProbType == "MD"]
## t = 1.2255, df = 43, p-value = 0.227
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05966682 0.24452466
## sample estimates:
## mean of the differences
## 0.09242892
sd(df$rt, na.rm = T)
## [1] 0.3082921
We are having a really rough day with these hypotheses. We could write these results ourselves like we did last time, but let’s take a moment to pull out a nifty little tool from the report package. Rather than diving deeper into self-doubt and misery by systematically documenting the remnants of what was once a perfectly reasonable hypothesis, report will save us the trouble and summarize the results of our test (albeit a little imperfectly) for us.
report(ttest)
## Effect sizes were labelled following Cohen's (1988) recommendations.
##
## The Paired t-test testing the difference between df$rt[df$ProbType == "SD"] and df$rt[df$ProbType == "MD"] (mean of the differences = 0.09) suggests that the effect is positive, statistically not significant, and very small (difference = 0.09, 95% CI [-0.06, 0.24], t(43) = 1.23, p = 0.227; Cohen's d = 0.18, 95% CI [-0.12, 0.49])
Let’s run a t-test examining differences between conditions regarding reaction time. Use the report function to write up the results.
'
'
An ANOVA, or Analysis of Variance, can be used when both the predictor variable or variables consist of two or more categorical options and the outcome or dependent variable is numeric in value. Much like a T-Test, ANOVA tells you how significant the differences between these categories or groups are. The advantage over T-tests is that we can compare multiple groups or categories in one analysis. We could revisit our last horrible example and imagine that the evil teacher tells one group right chapter to study from, one group the wrong chapter to study from, and one group to not study at all. An ANOVA test will tell us whether any of these three groups are different from one another (but not necessarily which specific groups are different from one another). be Test Score. Within the context of our data, we could ask more complex questions than we had in the past, like whether differences exist in reaction times among the different combinations of conditions and problem types.
QUESTION: Are there differences in reaction times among the different combinations of conditions and problem types?
HYPOTHESIS: Differences will exist across the different combinations.
RELEVANT VARIABLES: Dependent: rt (numeric) Independent: ProbType (Factor) Independent: Condition (Factor)
ANALYSIS: ANOVA
Much like the assumptions for T-Tests, we will not dive deep into the assumption tests we run for ANOVA. I wanted to include them here to demonstrate how adding an additional predictor changes these tests. We still run the same tests, we just add a new dimension by including the other variables.
#QQ Plots for ANOVAs
ggqqplot(df, "rt", ggtheme = theme_bw()) +
facet_grid(Condition~ProbType, labeller = "label_both")
# Box Plot for ANOVAs
ggboxplot(df,
x = "ProbType",
y = "rt",
palette = "jco",
facet.by = "Condition",
short.panel.labs = FALSE)
# Outlier tests for ANOVAs
df %>%
group_by(ProbType, Condition) %>%
identify_outliers(rt)
# Shapiro-Wilk Test for Normality for ANOVAs
df %>%
group_by(ProbType, Condition) %>%
shapiro_test(rt)
# Homogeneity of variance for ANOVAs
df %>%
group_by(Condition) %>%
levene_test(rt ~ ProbType)
Pay close attention to the formatting of the syntax here. It is the standard way in which we specify statistical models in R, whether for regression, ANOVA, hierarchical modeling etc.
aov <- aov(rt ~ ProbType * Condition, data = df)
The Base R command we use to specify that this is an ANOVA model is aov(). We then place our outcome/criterion/dependent variable next, followed by a tilde (~). Following the tilde comes all of the predictor/independent variables. Notice we use an asterisk (*) between the two terms. This is equivalent to writing out the model as this:
aov <- aov(rt ~ ProbType + Condition + ProbType:Condition, data = df)
It analyzes the data for the main effects of Problem Type and Condition, but also the interaction between the two (represented by the two terms joined by a colon(:)). Once the model has been specified, we note its end with a comma and tell the model where the data exists. Now if we run either of those models, we’ll see a new object by the name aov. But when we call that object…
aov
## Call:
## aov(formula = rt ~ ProbType + Condition + ProbType:Condition,
## data = df)
##
## Terms:
## ProbType Condition ProbType:Condition Residuals
## Sum of Squares 0.383434 0.533005 1.067094 10.847409
## Deg. of Freedom 2 1 2 130
##
## Residual standard error: 0.2888626
## Estimated effects may be unbalanced
… the information isn’t really formatted in a way that’s immediately meaningful or understandable. We typically look towards metrics like p values or means to understand ANOVA results and those are not present here. In order to see those, we need to summarize the ANOVA object.
summary(aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## ProbType 2 0.383 0.1917 2.298 0.10457
## Condition 1 0.533 0.5330 6.388 0.01269 *
## ProbType:Condition 2 1.067 0.5335 6.394 0.00225 **
## Residuals 130 10.847 0.0834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The report function works equally well on ANOVA objects as well.
report(aov)
## The ANOVA (formula: rt ~ ProbType + Condition + ProbType:Condition) suggests that:
##
## - The main effect of ProbType is statistically not significant and small (F(2, 130) = 2.30, p = 0.105; Eta2 (partial) = 0.03, 90% CI [0.00, 0.09])
## - The main effect of Condition is statistically significant and small (F(1, 130) = 6.39, p = 0.013; Eta2 (partial) = 0.05, 90% CI [5.57e-03, 0.12])
## - The interaction between ProbType and Condition is statistically significant and medium (F(2, 130) = 6.39, p = 0.002; Eta2 (partial) = 0.09, 90% CI [0.02, 0.17])
##
## Effect sizes were labelled following Field's (2013) recommendations.
The anova_stats() function from the sjstats package will give us some more details about our model, like effect sizes. Unfortunately, R won’t recognize this function without specifying the package it comes from, for reasons I don’t understand, so that’s why I’m using the double colon (::) again to note which package it comes from. The function will automatically round values to 3 digits as well..
sjstats::anova_stats(aov,
digits = 3)
## term | df | sumsq | meansq | statistic | p.value | etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f | power
## ---------------------------------------------------------------------------------------------------------------------------------------------------
## ProbType | 2 | 0.383 | 0.192 | 2.298 | 0.105 | 0.030 | 0.034 | 0.017 | 0.019 | 0.017 | 0.188 | 0.469
## Condition | 1 | 0.533 | 0.533 | 6.388 | 0.013 | 0.042 | 0.047 | 0.035 | 0.038 | 0.035 | 0.222 | 0.715
## ProbType:Condition | 2 | 1.067 | 0.534 | 6.394 | 0.002 | 0.083 | 0.090 | 0.070 | 0.073 | 0.070 | 0.314 | 0.903
## Residuals | 130 | 10.847 | 0.083 | | | | | | | | |
This next block of code is pulled directly from a post I found on how to produce 95% confidence intervals on Stack Overflow. It is a custom function. Much like developers create functions or commands that exist in packages, we all could create custom functions to replace cumbersome or lengthy procedures we plan to use often. This custom function will calculate the standard deviation, standard error, and confidence intervals for the terms in our ANOVA model. We don’t really need to understand what’s under the hood right now, and we don’t need to make any modifications. We can just know that we are creating a new function called CustomSummary which we can use later.